A Deep Learning Approach to Safeguard Coral Reefs: Detecting Crown-of-Thorn Starfish in Underwater Footage

Part 2/2: Training & Results

By: Ken K. Hong, Oanh Doan

2. Approach

2.1. Data Processing

In this project, we first performed exploratory data analysis (EDA) to examine the dataset in detail, as described in the background section and Part 1. Next, we addressed the issue of label format incompatibility. The original dataset provided bounding box annotations using the pixel coordinates of the upper-left corner (x_min, y_min), along with the box’s width and height in pixels. We modified this setup to meet YOLO’s required format, which uses the coordinates of the bounding box’s center (x_center, y_center) and normalizes all coordinates and box measurements with respect to the original image’s width and height. Hence, all normalized values are bounded between 0 and 1.

In addition to label conversion, we filtered the dataset to include only frames containing annotations. Unlabeled frames did not provide valuable information for training. Frames without bounding boxes often act as negative examples, which, when excessive, can lead to unbalanced learning, degraded model performance, and computational inefficiency. According to Table 1, applying this filter results in a dataset of 4919 images for both training and validation.

For the training and fine-tuning processes, video0 and video1 (4242 images) are used as the training dataset, and video2 (677 images) is reserved for validation. This dataset splitting approach provides a more reliable evaluation of the object detection model’s performance and helps reduce the risk of overfitting to the training data. Additionally, instead of a random 80/20 split, a video-based split is used to prevent data leakage, as images from the same video sequence show high similarity.


Table 1: Data Distribution of Annotated Frames per Video

Video ID Total Frames Frames with Bounding Boxes % of Annotated Frames
0 6708 2143 31.95%
1 8232 2099 25.50%
2 8561 677 7.91%
Total 23501 4919 20.9%


2.2. Model

We chose to use YOLOv11 [4] by Ultralytics, a state-of-the-art object detection model known for outperforming others in accuracy and efficiency. YOLOv11 is particularly well-suited for object detection tasks due to its real-time inference capabilities and superior performance in detecting small and intricate objects.

We began our experiments with the YOLOv11 nano model, which is optimized for lightweight, efficient training and inference. We trained the model for 10 epochs using default hyperparameters and an input image size of 640. The model resulted in a mAP50 of 0.3, which was a promising signal of the model’s potential. Building on these results, we decided to scale up to YOLOv11 small as our baseline model and experiment with larger models, including YOLOv11 medium, large, and even X-large. We hypothesize that larger models, equipped with more complex layers, are better at capturing low-level details, which would lead to improved model performance. This is particularly important for our task, where the starfish blend seamlessly into the coral reef and have subtle features that can be difficult to distinguish.

Apart from the model size, we also tested various hyperparameters, including the use of cosine learning rate scheduling and the Adam optimizer with weight decay. A detailed discussion of hyperparameter tuning experiments is provided in Table 3.

Beyond tuning hyperparameters to optimize the baseline model, we attempted two techniques to further enhance model performance:

  1. Dummy Classes for Regularization: We added dummy classes to the output layer as a form of regularization. We magnified the number of prediction classes, but no actual labels were created for these dummy classes. We expected this technique to reduce overfitting and force the model to generalize better to unseen data.

  2. Integration of CBAM (Convolutional Block Attention Module): We integrated CBAM to refine the feature extraction process. Specifically, we inserted CBAM into the first two C3K2 layers in the YOLO model, which are part of the model backbone and responsible for feature extraction. With this update, CBAM processes the output of the original C3K2 layers to suppress irrelevant information before passing it on to the next layer.

    We chose to insert CBAM into only the first two C3K2 layers for two main reasons:

    • They operate at early stages of the network, where broader and lower-level features are extracted.
    • The architecture of these two layers is the same, which reduces friction in implementation.

We opted not to insert CBAM into deeper layers in the network to minimize computational and implementation overhead, particularly when these layers focus on higher-level, abstract features, where the need for attention refinement is less critical.

Date Preparation & Train/Val Split

3. Experiments and Results

3.1. Experiment Setup

Computational resources are a key limiting factor in this project. The maximum GPU available on Google Colab (A100 with 40 GB memory) is insufficient to train models with large image sizes, batch sizes, and complex model configurations. While the most systematic approach to tune hyperparameters is to utilize the model.tune() function provided by Ultralytics, we couldn't pursue this direction due to resource constraints. Instead, we decided to override several hyperparameters with selected values based on our research. These values are detailed in Table 2.

We used these values as the new ‘default’ for subsequent tuning of hyperparameters, as shown in Table 3, except for Model 5b.

Initially, we tested a baseline YOLOv11 small model using an image size of 640 and a batch size of 8. Given the nature of our task, it is intuitive that higher image resolutions allow the model to learn finer details that might otherwise be lost with lower resolution images. Therefore, we increased the image size to 1280 for all subsequent experiments.

After evaluating the baseline model, we fine-tuned the models by varying batch sizes and introducing dummy classes. Batch size impacts gradient stability during training. Smaller batch sizes may lead to noisier updates, while larger batch sizes provide more stable gradients but demand more GPU memory. Dummy classes were introduced to mitigate overconfidence in the model's predictions. Finally, we incorporated CBAM into the best-performing model configuration.

3.2. Experiment Results

Table 3 presents the experiment results. We use two main metrics to evaluate model performance: mAP50 and mAP50-95.

These metrics reveal how precise and detailed the detections are.

Key Findings:

After evaluating various configurations, the best-performing model was Model 1, a YOLOv11 small model with a 1280-pixel image size, a batch size of 16, and trained for 20 epochs. This model achieved the highest mAP50 of 0.729 and mAP50-95 of 0.331, while consuming significantly less computational resources than other models.

The incorporation of CBAM into this model (creating Model 5a) resulted in no learning (mAP50 close to 0) and early termination after only 8 epochs. We performed a binary search to identify which overridden hyperparameters caused the model to degrade. The problem was traced to:

  1. Adam Weight Decay
  2. Cosine Learning Rate Scheduler
  3. Tuned Augmentation Hyperparameters

We hypothesized that the large gradient updates from AdamW in the early stages of training, combined with extreme input distortion, disrupted CBAM's learning process. By removing the overridden hyperparameters, we returned to YOLO’s default configuration with CBAM (Model 5b), which achieved results comparable to Model 1.

Several factors may explain why CBAM did not lead to a substantial improvement:

Thus, the best-performing model in our experiments remains Model 1, which was YOLOv11 small trained with an image size of 1280 pixels, a batch size of 8, and trained for 20 epochs.

Model Performance and Training Curves

Result Section provides the training and validation curves for Model 1. Both training and validation loss curves show a consistent downward trend, while accuracy metrics demonstrate steady improvement as training progresses. One notable pattern is the sudden increase in training losses in the last two epochs, which is due to reducing the hyperparameter close_mosaic from its default value of 10 to 2. However, this adjustment did not negatively impact the model's overall performance. The validation losses and accuracy metrics remained stable, confirming that the model was neither overfitting nor underperforming.

In terms of inference speed, the model processes images in 7.0 ms per image, comprising:

This corresponds to approximately 143 frames per second (FPS), making it highly suitable for real-time underwater COTS detection. Result Section visualizes the model’s predictions for a sample frame.


Table 2: Overridden hyper-parameters

Hyperparameter Training Value Default Reason for Overriding
imgsz 1280 640 Higher resolution images allow for learning of subtle patterns
epochs 20 100 Reduce due to resource constraint.
patience 3 100 Early Stopping Criteria.
cache True False Speed up data loading during training.
workers 2 8 Prevent memory issues caused by data loading.
optimizer AdamW auto Improved weight regularization.
cos_lr True False Use a cosine learning rate scheduler.
close_mosaic 2 10 Disable mosaic augmentation in the last N epochs. Correspond to the reduced number of epochs.
lrf 0.0001 0.01 Final learning rate = 10e-6, allowing for smoother convergence.
warmup_epochs 2 3 Reduce warmup due to the reduced total epochs.
nbs 8 or 16 64 Reduce due to resource constraint.
dropout 0.3 0 Reduce overfitting due to the small COTS dataset size.
degrees 160 0.0 Increase image rotation.
translate 0.5 0.1 Increase variability in object positions.
shear 90 0.0 Increase shear augmentation.
flipud 0.1 0.0 Probability of flipping the image left to right.
bgr 0.01 0.0 Probability of flipping image channels from RGB to BGR.
mixup 0.2 0.0 Probability of blending images and labels, enhancing generalizability.


Table 3: YOLO v11 Fine-Tuning Results
Best Model * | CBAM w/ default settings †

Model Model Size CBAM Epochs Image Size # Dummy Classes Batch Size mAP50 mAP50-95 GPU Usage (Used/Total GB)
Baseline Small 20 640 0 8 0.31 0.133 2.5/22.5
1 Small 20 1280 0 8 0.729* 0.331 9.1/22.5
2 Small 20 1280 10 8 0.701 0.331 9.1/22.5
3 Small 20 1280 0 16 0.722 0.339 16.5/22.5
4 Small 20 1280 10 16 0.706 0.328 16.5/22.5
5a Small Y 20 1280 0 8 0 0 Out-of-Memory
5b Small Y 20 1280 0 16 0.744† 0.392 17.5/22.5
6 Medium 20 1280 0 16 0.721 0.330 32.5/40.5
7 Medium 20 1280 5 16 0.698 0.321 32.5/40.5
8 Medium -- 1280 0 32 -- -- Out-of-Memory
9 Medium -- 1920 0 32 -- -- Out-of-Memory
10 Large 20 1280 0 16 0.674 0.306 32.5/40.5
11 Large -- 1280 0 32 -- -- Out-of-Memory
12 X-Large -- 1280 0 16 -- -- Out-of-Memory



Summary of Key Points:

Best Model:

Key Findings:

Best Performance Metrics:

Hyperparameter

Dummy Classes

CBAM

Model Training

Results:

Conclusion:

In this project, we developed a deep learning model for real-time Crown-of-Thorns Starfish (COTS) detection using the state-of-the-art YOLOv11 architecture from Ultralytics, combined with the CSIRO Crown-of-Thorns Starfish Detection dataset. While we explored enhancements such as CBAM (Convolutional Block Attention Module) and the introduction of dummy classes, these methods did not yield significant improvements during the fine-tuning process. However, it is possible that with increased computational resources, a larger YOLOv11 model, higher input image sizes, larger batch sizes, and extended training epochs, these techniques could prove more effective.

Our best-performing model currently achieves an mAP50 of 0.729 and an inference speed of 7.0 ms per image (approximately 143 frames per second). It consists of 238 layers, 9,413,187 parameters, and a computational complexity of 21.3 GFLOPs. These performance metrics indicate that the model is highly suitable for real-time underwater COTS detection. It offers significant improvements in the efficiency of COTS removal efforts, contributing to the ongoing protection of coral reefs.

Reference:

[1] Australian Institute of Marine Science. Reef monitoring sampling methods, n.d. Accessed: 2024-12-07.

[2] Jiajun Liu, Brano Kusy, Ross Marchant, Brendan Do, Torsten Merz, Joey Crosswell, Andy Steven, Nic Heaney, Karl von Richter, Lachlan Tychsen-Smith, David Ahmedt-Aristizabal, Mohammad Ali Armin, Geoffrey Carlin, Russ Babcock, Peyman Moghadam, Daniel Smith, Tim Davis, Kemal El Moujahid, Martin Wicke, and Megha Malpani. The CSIRO Crown-of-Thorn Starfish Detection Dataset, 2021.

[3] Ultralytics Team. Hyperparameter tuning guide. https://docs.ultralytics.com/guides/hyperparameter-tuning/#what-are-hyperparameters, 2024. Accessed: 2024-12-09.

[4] Ultralytics. YOLOv11: Object detection and image segmentation models. https://docs.ultralytics.com/models/yolo11/, 2024.

[5] UNESCO World Heritage Centre. Great Barrier Reef, n.d. Accessed: 2024-12-07.

[6] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. CBAM: Convolutional Block Attention Module, 2018.